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3.3 M-estimators and their consistency. A sequence of estimators T^, one for each 
sample size n, possibly only defined for n large enough, is called consistent if for Xi, X2, . . . , 
i.i.d. (Pe), = T„(Xi,... ,X„) converges in probability as n ^ cxd to a function g{9) 
being estimated. This section will treat consistency of estimators which are more general 
than maximum likelihood estimators in two ways, first that the function being maximized 
may not be a likelihood, and second that it only needs to be approximately maximized. 

It will be assumed that the parameter space O is a locally compact separable metric 
space with a metric d, such as an open or closed subset of a Euclidean space. (X, A, P) 
will be any probability space, and h = h{9, x) is a measurable function on Q x X with 
values in the extended real number system [—00, 00]. One example will be the negative of 
the log likelihood function, h{6,x) = —logf{6,x). This will be called the log likelihood 
case. Let Xi, X2, ... be independent random variables with values in X and distribution 
P, specifically, coordinates on the countable product P°°) of copies of [X, A, P) 

(RAP, Sec. 8.2). A statistic Tfi — T^(Xi, ...,Xn) with values in will be called an M- 
estimator if 

lY.UKTn,x,) = inf.ee },Y.UKe,x{). 

Thus, in the log likelihood case, an M-estimator is a maximum likelihood estimator. 

At first reading, the reader may prefer to skip the next paragraph and in what fol- 
lows, interpret convergence in outer probability as convergence in probability, and almost 
uniform convergence as almost sure convergence. 

The outer probability P* (C) of a not necessarily measurable set C is defined by 

P*{C) := M{P{A): A DC, A measurable}. 

Let fn be a sequence of not necessarily measurable functions from a probability space into 
a metric space S with metric d. Then are said to converge to /o in outer probability 
if for every £ > 0, P*{d{fm /o) > £) — ^ as n — > 00. Also, fn are said to converge to /o 
almost uniformly if for every e > 0, P*(sup^>„ d{fm, fo) > e) — > as n — > 00. If d{fm, fo) 
is a measurable random variable, as it will be in nearly all actual applications, then almost 
uniform convergence is the same as almost sure convergence. 

Statistics — T„(Xi, X^) with values in will be called a sequence of approximate 
M-estimators if as n — > 00, 

(3.3.1) ^ ELi h{Tn, X,) - inf.ee ^ Eti K^, X,) ^ 

almost uniformly. 

It will be proved that T„ converges almost uniformly to some 6*0 under a list of as- 
sumptions as follows. It is known that under some weaker conditions converges to ^0 
in outer probability, but for simplicity that will not be shown here. 

(A-1) h{9,x) is a separable stochastic process, meaning that there is a set ^ C X with 
P{A) — and a countable subset S <Z Q such that for every open set U C and every 
closed set J C [—00,00], 

{x: h(9,x)eJ foraU 9 E S nU} C AU {x : h{9,x) E J for all 9eU}. 
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This will be true with A empty if each function h{-,x) is continuous on and S is dense 
in 0, but the assumption is valid in more general situations. An alternate, equivalent 
formulation of separability is that for some countable S and almost all x, the graph of 
h{-,x) restricted to S is dense in the whole graph (Appendix C). For example, if G is an 
interval in R, and for almost all x, h{-,x) is either left-continuous or right-continuous at 
each 9, then h{-, •) is a separable process. 

It is known that by changing h{9,x) only for x in a set of probability (depending 
on 9), one can assume that h is separable (by a theorem of Doob, Appendix C, Theorem 
C.2). But in statistics, where the probability P is unknown, the separability is more clearly 
attainable in case h has at least a one-sided continuity property as just mentioned. 

Instead of continuity, here is a weaker assumption: 

(A-2) For each x in X, the function h{-,x) is lower semicontinuous on 0, meaning that 
h{9,x) < liminf<^^5) /i(0, a;) for all ^. 

Often, but not always, the functions h{-,x) will be continuous on 0. Consider for 
example the uniform distributions U[9j9 + 1] on M for 6* e R. The density f{9,x) := 
^[e,9+i]{x) is not continuous in 9, but it is upper semicontinuous, 

f{9,x) > \imsup^^of{(j),x). 

It follows that the functions h{9,x) = — logf{9,x) are lower semicontinuous (they have 
values +CXO for x ^ [0,0 + 1]). This is a reason for choosing the densities to be indicator 
functions of closed intervals; if we had taken f{9,x) = 1(5) 51+1) (x), then h{9,x) would no 
longer be lower semicontinuous. 

For any real function /, as usual let /+ := max(/, 0) and f~ := — min(/, 0). A 
function h{-,-) of x and 9 will be called adjusted for P if 

(3.3.2) Eh{9,x)-<oo for aU e 0, and 

(3.3.3) Eh{9,x)+< 00 for some 9 E Q. 

To say that h is adjusted is equivalent to saying that Eh{9, ■) is well-defined (possibly +00) 
and not —00 for all 6*, and for some 6*, also Eh{9, •) < +cxd, so it is some finite real number. 

If a(-) is a measurable real- valued function on X such that h{9,x) — a{x) is adjusted 
for P, then h{-,-) will be called adjustable for P and a(-) will be called an adjustment 
function for h and P. The next assumption is: 

(A-3) /i(-, •) is adjustable for P. 

From here on, if h{9,x) is adjustable but not adjusted, let 7(0) := 7a (^) := 
E[h{9,x) — a{x)] for a suitable adjustment function a(-). As an example, let h{9,x) := 
\x — 9\ for 0,a; G M. If P is a law on M, such as the Cauchy distribution with density 
(7r(l + x^))~^, with j\x\dP{x) = +00, then h itself is not adjusted and an adjustment 
function is needed. Let a{x) \x\ in this case. Then for each 9, |a; — 6*1 — \x\ is bounded 
in absolute value (by \9\), so 7(0) is defined and finite for all 9. Thus \x\ is in fact an 
adjustment function for any P. 
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The example illustrates an idea of Huber (1967,1981) who seems to have invented the 

notion of adjustment. An estimator is defined by minimizing or approximately minimizing 
- Yl7=i If / h(9,x)dP{x) is finite, it is the limit of the sample averages by the 

strong law of large numbers. But if it isn't finite, it may be made finite by subtracting 
an adjustment function a{x) from h. Since a(-) doesn't depend on $, this change doesn't 
affect the minimization for each n. Thus, such estimators can be treated for more general 
probability measures P which on the real line, for example, can have long tails. Allowing 
for such distributions is called robust statistics, e.g. Huber (1981). In fact, in the last 
example, P can be an arbitrary (and so arbitrarily heavy-tailed) distribution on M. 

3.3.4 Proposition. If ai is an adjustment function for h{-, ■) and P, then another mea- 
surable real- valued function a2(-) on X is also an adjustment function if and only if ai — a2 
is integrable for P, and {6 : '^{9) e M} does not depend on the choice of adjustment 
function a(-). 

Proof. "If" is clear. To prove "only if," we have E{{h{9,x) — ai{x))~) < oo for all 9 and 
z = 1, 2, while E{{h{9i, x) — ai{x))'^) < oo for some 9i and z = 1, 2. We can write for 9 = 9\ 
or 6*2, 

{ai — a2){x) — h{9 , x) — a2{x) — [h{9 , x) — ai{x)\ 

for P-almost all x. To check this we need to take account that h can have values ±cxo. For 
any 6*, h{9, x) > — oo for P-almost all x since h is adjustable. We have h{9i, x) < +oo and 
h{92, x) < +00 for P-almost all x. Thus the given expression for {ai — a2){x) is well-defined 
for P-almost all x and 9 = 9i or 92- We then have 

E{{ai-a2)+) < E[{h{92,x)-a2{x)y]+E[{h{92,x)-ai{x))-] < oo, 

E{{ai-a2)-) < E[{h{9i,x)-a2{x))-] + E[{h{9i,x)-ai{x))+] < oo, 

so E\ai — < oo as stated. Thus, the sets of 9 for which E{{h{9,x) — ai{x))^) < oo, or 
equivalently E\h{9,x) — ai{x)\ < oo, don't depend on z, as stated. This finishes the proof 
of the proposition. □ 

The next assumption is: 

(A-4) There is a e © such that ^{9) > 7(^0) for aU 9 ^ 9o. 

By Proposition 3.3.4, 9o does not depend on the choice of adjustment function. After 
some more assumptions, it will be shown that Tn converges to ^o- 

If G is not compact, let 00 be the point adjoined in its one-point compactification 
(RAP, 2.8.1) and let liminfg^oo mean sup^inf^i^^ where the supremum is over all 
compact K. The next assumption is 

(A-5) For some adjustment function a(-), there is a continuous function b{-) > on such 
that 

(3.3.5) M{{h{9,x) -a{x))/b{9) : 9 e Q} > -u{x) 
for some integrable function u{-) > 0, and if is not compact, then 

(3.3.6) \immf0^^b{9) > 7a(^o) and 
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(3.3.7) 



E{lim-mi0^^{h{e,x) - a{x))/b{9)} > 1. 



This completes the hst of assumptions. Here (3.3.5) and (3.3.7) may depend on the 
choice of adjustment function. In the example where X = = M, h{9,x) = \x — 9\ and 
a{x) := \x\, all the assumptions hold if b{9) :— \9\ + 1 and P is any law on M with a 
unique median. Consistency, to be proved below, will imply that sample medians converge 
to the true median in this case. 

Some consequences of the assumptions will be developed. The first one follows directly 
from Proposition 3.3.4 and the definitions: 

3.3.8 Lemma. For any adjustable h{-, •) and adjustment function a(-) for it, and any 
9 e for which 7a (^) £ K, h{9, •) is also an adjustment function. 

A sequence of sets t/fc C will be said to converge to a point 9 if sup{d{9, 0) : e 
Uk} — > as A; — > oo. Next, we have 

3.3.9 Lemma. If (A-1), (A-2), and (A-3) hold and a(-) is an adjustment function for 
which (3.3.5) holds, with b{-) continuous, then 

(A-2') for any 9, as an open neighborhood Uk of 9 converges to {9}, 

E{M{h{(l),x) - a{x) : (j) e Uk}) -f{9) < +oo. 

Proof. Separability (A-1) applied to sets J = [g, -|-oo) for all rational q and joint measur- 
ability of /i(-, •) imply that the infimum in (A-2') is equal almost surely to a measurable 
function of x. By (A-2), the integrand on the left converges to h{9, x) — a{x)^ and it is larger 
for smaller neighborhoods Uk, so in this sense the convergence is monotone. Since h{-) is 
continuous and positive, it is bounded on any neighborhood Uk with compact closure, say 
< 6((/)) < M for all e Uk- Then by (3.3.5), /i(0, x) - a{x) > -Mu{x) for aU G C/fc 
and all x. Thus the stated convergence holds by monotone convergence (RAP, 4.3.2) for 
a fixed sequence of neighborhoods of 9 such as {0 : d{(j>,9) <l/n) where (i is a metric 
for the topology of 0. So, for any £ > 0, there is a neighborhood Uk of 9 such that the 
expression being shown to converge is larger than 7(^) — £ if '^(9) is finite, or larger than 
l/£ if 7(6*) — +00, and the same will hold for any smaller neighborhood. □ 

Note that (3.3.1), the definition of approximate M-estimator, is not affected by sub- 
tracting a{x) from h{9,x). 

By the alternate formulation given for separability (A-1), h{9,x) — a{x) is separable 
and since b{9) is continuous and strictly positive, {h{9,x) — a{x))/b{9) is also separable. 

For any adjustable /i(-, •) and adjustment function a(-) for it, let ha{9, x) := h{9, x) — 
a{x). If (A-5) holds, this notation will mean that a(-) has been chosen so that it holds. 

3.3.10 Lemma. If (A-1), (A-3), (A-4), and (A-5) hold, then there is a compact set C C 
such that for every sequence of approximate M-estimators, almost surely there will be 
some no such that e C for all n > no, in the sense that 

(3.3.11) l^Tr,€C} ~^ 1 almost uniformly as n — > oo. 
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Proof. If Q is compact there is no problem. Otherwise, by (3.3.6) there is a compact C 
and an e with < £ < 1 such that 

mi{b{e): e^C} > (7(^0) +£)/(!-£)• 

(Note: the 1 — £ in the denominator is useful when 7(6*0) + £ > and otherwise makes 
Httle difference as elO.) By (3.3.5), (3.3.7), (A-1), and monotone convergence as in the 
last proof, C can be chosen large enough so that 

E{mf{ha{e,x)/b{e) : e^C}) > l-s/2. 

Then by the strong law of large numbers (RAP, Sec. 8.3), where a function with expectation 
+00 can be replaced by a smaller function with large positive expectation, a.s. for n large 
enough 

iinf{Er=i^a(^,X,)/6(^): 9 ^ C} > i^j:t,mf{ha{d,Xi)/b{e): 9 ^ C} > 1-e. 

Note that the infima are measurable since by separability of h{-,-), measurability of a(-) 
and continuity of b{-), they can be restricted to a countable (dense) set in the complement 
of C. Then for any O^C, 

(3.3.12) ^Er=i^a(^,^i) > il-s)bi0) > 7(^0) +e. 

On the other hand, for n large enough 

mfe^Er=i^a(^,^^) < v^^7=iha{0o.X,) < 7(^0) +e/2, 

so as soon as the expression in (3.3.1) is less than e/2, the same will hold for ha since 
terms a(Xj) cancel, and E C. □ 

3.3.13 Theorem. Let {T^} be a sequence of approximate M-estimators. Assume either 
(a) (A-1) through (A-5) hold, or (b) (A-1), (A-2'), (A-3) and (A-4) hold, and for some 
compact C, (3.3.11) holds. Then T„ — > do almost uniformly. 

Proof. Assumptions (a) imply (A-2') by Lemma 3.3.9, and (3.3.11) by Lemma 3.3.10. So 
assumptions (b) hold in either case. By (3.3.11), can be assumed to be a compact set 
C: take any point t(j of C and when takes a value outside of C, redefine it as ijj. It can 
also be assumed that 6*0 € C by adjoining it if necessary, and the proof below will show 
that 6*0 had to be in C. 

Let U be an open neighborhood of ^o- It follows from (A-2') that 7(-) is lower semi- 
continuous. Thus its infimum on the compact set C\U is attained: let dk be a sequence 
in C \ [/ on which 7 converges to its infimum; we can assume that Ok converges to some 

and then 7 attains its minimum on C\U at 9oo- By (A-4), mfc\u7 = 7(6*00) > 7(6*0)- 
Let £ := (7(6*00) - 7(6*o))/4, or if 7(6*00) = +00 let e := 1. By (A-2'), each 6 eC\U 
has an open neighborhood Ug such that 

E{mf{ha{(l>,x): (PeUe}) > 7(^0) + 3£. 

Again, the infimum is measurable since by separability it can be restricted to a countable 
dense set in Uo- Take finitely many points 9{j), j = 1, . . . , N, such that the neighborhoods 
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Uj := UQ(^j^ cover C \ U. By the strong law of large numbers, as in the proof of Lemma 
3.3.10, we have a.s. for n large enough and each j = 1,. . . ,N, 

^^f{kT,tihai<l>,Xi): 0et/,} > iELiinf{/ia(0,X,): cf> E Uj} > 7(^0) + 2£ 

and SlLi ^a(^o, ^i) < 7(^0) + £• It follows that 

(3.3.14) 

> inf{^Er=i^a(^,X,): 9eU} + e, 

so Pr{Tn e U for all n large enough} = 1. This completes the proof. □ 

To apply Theorem 3.3.13 to the case of maximum likelihood estimation the following 
will help. Let P and Q be two laws on a sample space (X, B). Let 

I{P,Q) := J log{Rp/Q)dP = - J log{RQ/p)dP, 

called the Kullback-Leibler information of P with respect to Q. Here we have Rp/Q = 
l/i?Q/P with 1/0 := +ooandl/ + oo := 0. 

3.3.15 Theorem. Let {X,B) be a sample space and P,Q any two laws on it. Then 
I{P, Q) > and I{P, Q) = if and only if P = Q. 

Proof. By derivatives, it's easy to check that log a; < a; — 1 for all a; > 0, with logo; = x — 1 
if and only if a; = 1. Thus 

/(P,Q) = j -\og{RQ/p)dP > j 1-RQ/pdP > 0, 

with equality if and only if Rq/p = 1 a.s. for P, and then Q = P. □ 

Consistency of approximate maximum likelihood estimators, under suitable condi- 
tions, does follow from Theorem 3.3.13, and assumption (A-3), and (A-4) for the true 9q, 
will follow from Theorem 3.3.15 rather than having to be assumed: 

3.3.16 Theorem. Assume (A-1) holds in the log likelihood case, for a measurable family 
{Pe, 6* e 0} dominated by a (j-finite measure v, with {dPe/dv){x) = f{9,x), so that 
h{9,x) := — log/(6',a;). Also suppose P = Pq^ for some 9q & Q and Pq^ ^ Pq for any 
9 ^ 9q. Then (A-3) holds and (A-4) holds for the given ^o- Assume T„ are approximate 
maximum likelihood estimators, i.e. approximate M-estimators in this case. If (A-2) and 
(A-5) also hold, or (A-2') and (3.3.11), then the are consistent. 

Proof. If (A-1) through (A-5) hold then (A-2') and (3.3.11) hold by Lemmas 3.3.9 and 
3.3.10, and then Theorem 3.3.13 applies. So just (A-3) and (A-4) need to be proved. Let 
a{x) := — log f{9Q,x). We have < f{9o,x) < 00 a.s. for P, and so —00 < log/(6'o,a;) < 
00. Thus h{9,x) — a{x) is well-defined a.s. and equals 

-log(/(^,a;)//(^o,a;)) = -logPp,/p,^ 
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as shown in Appendix A. Thus for all ^, 



7(^) := E[h{9,x)-a{x)] = I{Peo,Pe) > > -oo 

by Theorem 3.3.15 and 7(6*0) = 0, so (A-3) holds. Also by Theorem 3.3.15, 7(^) = only 
for ^ = so (A-4) also holds. □ 

It turns out apparently to be simpler to treat exponential families directly rather than 
apply the above general theorems to them: 

3.3.17 Theorem. Let {Pg, 9 e 0} be an exponential family in a minimal representation, 
where is the interior of the natural parameter space, and P = Pg^ for some $0 & Q. 
Then maximum likelihood estimators exist eventually a.s. and are consistent. 

Proof. As shown in Theorem 3.2.2, the existence of an MLE in is equivalent to existence 
of a solution in of the likelihood equations Vj{9) = '■= Sj=i ^(^j)- As n — > 00, 
the right side converges a.s. by the strong law of large numbers and Corollary 2.5.9 to 
ET — Vj{9o). The matrix of second derivatives of j equals the covariance matrix of T, 
also by Corollary 2.5.9, and it was shown in the proof of Theorem 3.2.2 that this matrix is 
strictly positive definite and so nonsingular. The (Hessian) matrix of second derivatives of 
j gives the derivative of Vj. It follows by the Inverse Function Theorem (Hoffman, 1975, 
Sec. 8.5, Theorem 7 p. 395) that Vj is one-to-one from some neighborhood of 9o onto a 
neighborhood of Vj{9o), and has a inverse. So almost surely, eventually the likelihood 
equations have a solution which is an MLE 9n = (Vj)~^(T„), and 9n — > ^0 a-s. by the 
continuity of (Vj)"-*^. □ 

Further consistency facts will be given in the following sections. 

PROBLEMS 

1. Let h{9, x) = {x- 9f for x,9 eR. 

(a) Show that h is adjustable for a law P if and only if / \x\dP{x) < 00. 

(b) Show that then (A-4) holds and evaluate 9o. 

(c) Show that for some a(-), (A-5) holds in this case for b{9) = 9^ + 1. 

2. Recall that for a law P on M, a point m is a median of P iff both P{{—oo,x]) > 1/2 
and P([a;, -l-oo)) > 1/2. Thus if P is a continuous distribution without atoms, m is a 
median if and only if P((— 00, m]) = 1/2. If P is any law on M having a unique median 
6*0 and h{9, x) := \x — 9\, show that conditions (A-1) through (A-5) hold for some a(-) 
and b{-) (suggested in the text). 

NOTES 

An early result relating to consistency of maximum likelihood estimators was given 
by Cramer (1946), §33.3, namely, that under some hypotheses, there exist roots of the 
likelihood equation(s) converging in probability to the true value ^o- If there are multiple 
roots, it was not clear how to select roots that would converge, but in case there was 
a unique root and it gave a maximum of the likelihood (as with exponential families), 
Cramer's theorem gave consistency of maximum likelihood estimates under his conditions. 
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Wald (1949) proved consistency of maximum likelihood estimates under some condi- 
tions. The present forms of the theorems and proofs through 3.3.13 are essentially as in 
Huber (1967). Dudley (1998) gave an extension, replacing the local compactness assump- 
tion by a uniform law of large numbers assumption. KuUback and Leibler (1951) defined 
their information and gave Theorem 3.3.15. KuUback (1983) gives an update. I am also 
indebted to Haughton (1983,1988) in regard to Theorem 3.3.17. 
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